Segmentation of Handwritten Characters for Digitalizing Korean Historical Documents
نویسندگان
چکیده
The historical documents are valuable cultural heritages and sources for the study of history, social aspect and life at that time. The digitalization of historical documents aims to provide instant access to the archives for the researchers and the public, who had been endowed with limited chance due to maintenance reasons. However, most of these documents are not only written by hand in ancient Chinese characters, but also have complex page layouts. As a result, it is not easy to utilize conventional OCR(optical character recognition) system about historical documents even if OCR has received the most attention for several years as a key module in digitalization. We have been developing OCR-based digitalization system of historical documents for years. In this paper, we propose dedicated segmentation and rejection methods for OCR of Korean historical documents. Proposed recognition-based segmentation method uses geometric feature and context information with Viterbi algorithm. Rejection method uses Mahalanobis distance and posterior probability for solving out-of-class problem, especially. Some promising experimental results are reported.
منابع مشابه
Processing and Recognition of Handwritten Documents
Nowadays, the accurate recognition of machine printed characters is considered largely a solved problem. A lot of commercial products are focused towards that direction, achieving high recognition rates. However, handwritten character recognition is comparatively difficult. So, the recognition of handwritten documents is still a subject of active research. In this thesis we studied the processi...
متن کاملIn Codice Ratio: Scalable Transcription of Historical Handwritten Documents
Huge amounts of handwritten historical documents are being published by digital libraries world wide. However, for these raw digital images to be really useful, they need to be annotated with informative content. State-of-the-art Handwritten Text Recognition (HTR) approaches require an impressive training effort by expert paleographers. Our contribution is a scalable, end-to-end transcription w...
متن کاملSegmentation Based Optical Character Recognition for Handwritten Marathi characters
Valuable ancient documents like historical books, old scripts etc. are available in specific regional languages. Problems occur when those documents have to be preserve in digital form or to modify them. Optical Character Recognition is used to convert the scanned document word, notepad or any other format, so that we can easily edit that document. A complete OCR system for handwritten Devanaga...
متن کاملText Extraction from Historical Handwritten Documents by Edge Detection
Many national archives or libraries keep large amount of historical handwritten documents. One problem that many archivists are facing is the sipping of ink through the pages of certain double-sided handwritten documents after long periods of storage. The result is that the handwritten characters from the reverse side appear as noise on the front side and even interfere with the front side char...
متن کاملText line and word segmentation of handwritten documents
In this paper, we present a segmentation methodology of handwritten documents in their distinct entities, namely, text lines and words. Text line segmentation is achieved by applying Hough transform on a subset of the document image connected components. A post-processing step includes the correction of possible false alarms, the detection of text lines that Hough transform failed to create and...
متن کامل